西班牙专利ES2716634A1 PROCEDURE AND SYSTEM FOR THE GENERATION OF EXTRACTIVE TEXT SUMMARIES USING NON-SUPERVISED DEEP APPRE

专利PDF首页>>西班牙专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
Procedure and system for the generation of extractive text summaries using deep unsupervised learning and self-coding. An automated procedure and system for summarizing extractive texts using unsupervised deep learning and self-encoders is described. This procedure makes use of deep machine learning to perform the coding of the text contained in the documents through phrase embedding techniques and its subsequent coding in a smaller dimension vector representation using a deep network of a self-encoder. From the original text, the resulting embedded sentences and the smaller-dimension vector representation are calculated as a measure of relevance, a measure of novelty and a measure of position per sentence respectively. From these three measurements, an order and selection of phrases according to their final score or frequency of appearance in the original document are made, which will be part of the final summary document. (Machine-translation by Google Translate, not legally binding)
公开号:ES2716634A1
申请号:ES201831222
申请日:2018-12-14
公开日:2019-06-13
发明作者:Akanksha Joshi；Fernández Eduardo Fidalgo；Gutiérrez Enrique Alegre；Robles Laura Fernández
申请人:Universidad de Leon；
IPC主号:

专利说明:

[0001]
[0002]
[0003]
[0004] OBJECT OF THE INVENTION
[0005] The object of the present invention is an automated process and system for extracting extractive text summaries using unsupervised deep learning and self-encoders. The invention allows to summarize a document in an extractive way, that is, to select the most relevant fragments of the document and form a document of smaller size and to identify the textual content thereof. Said smaller document would allow a user to know the subject or content of an extensive text document without making a complete reading of the document.
[0006]
[0007] BACKGROUND OF THE INVENTION
[0008] With the arrival of the Internet and the large amount of data available, the number of texts and documents with textual content has increased significantly. In order to manage this information contained in these documents, there is a need to look for a smaller representation of them that collects the fundamental information, that is, a summary. Automatic text summarization is an important branch of natural language processing that aims to represent long text documents in a compressed form so that the most relevant information can be understood and quickly identified by end users.
[0009]
[0010] There are two types of text summaries, extractive text summary and abstract text summary (Gambhir, M., & Gupta, V. (2017).) Recent automatic text summarization techniques: a survey Artificial Intelligence Review, 47 (1) ). The extract of extractive text concatenates the most relevant sentences of the document to produce the summary. As an alternative to the extractive summary, an abstract abstract can be made, where exact phrases of the document itself are not used, but a summary is generated to paraphrase the main contents of the document using natural language generation techniques.
[0011]
[0012] There are traditional techniques of text summaries, which are based on the combination of statistical and linguistic characteristics, such as the frequency of the terms (Luhn, HP
[0013] (1958). The automatic creation of literature abstracts. IBM Journal of Research Development .; Ani Nenkova and Lucy Vanderwende. 2005. The impact of frequency on summarization. Technical report, Microsoft Research) or the length and position of the sentence among others. In these methods, a score is assigned to each sentence according to its characteristics. These sentences are then chosen to be part of the final summary using graph-based approaches (Radev, D., and Erkan, G. (2004).) Lexrank: Graph-based lexical centrality as salience in text summarization Journal of Artificial Intelligence Research 457 479.) or approaches based on optimization (Mcdonald, R. (2007) A Study of Global Inference Algorithms in Multi-Document Summarization, In Proceedings of the 29th European conference on IR research, 557-564) among others.
[0014]
[0015] Currently, the techniques of text summaries have evolved to the use of deep learning algorithms, given the power and good results of them in multiple problems of Natural Language Processing (NLP). Despite this, there is a need to have large amounts of data to obtain adequate training of the network, which is a disadvantage in the use of supervised deep learning for the generation of summaries of text documents.
[0016]
[0017] The present invention solves the problems presented by the methods of the prior art such as, for example, the need for large amounts of documents for the training of the algorithms, by exploiting techniques that do not require tagged data for training, especially the Unsupervised deep learning approach based on self-encoders and sentence inlays, through deep learning networks previously trained using a predefined data set.
[0018]
[0019] Obtaining large amounts of data for training a deep learning algorithm to summarize text documents presents a number of drawbacks. First, it is necessary to have a large number of documents summarized in an extractive and manual way by a person. Second, it is common for data sets that contain text summaries that each original document has several abstracts associated, each made by a human operator. In addition, the summary of a document depends to a great extent on the person who performs it, providing subjectivity, which generates a disparity of content between the different summaries, which will be used to train the model. Finally, it is an expensive process due to the high costs associated with the time of the person who makes the summaries.
[0020] Due to the above problems to have the necessary data to train a model of automatic text summaries using supervised deep learning, we resort to the performance of automatic text summaries using deep unsupervised learning.
[0021]
[0022] Several applications of Natural Language Processing are known that aim to improve the task of text summary by exploiting the capabilities of deep machine learning (Rush, AM, Chopra, S., & Weston, J. (2015) A Neural Attention Model for Abstractive Sentence Summarization, (September), In Proceedings of Empirical Methods on Natural Language Processing, 379-389; Nallapati, R., Zhou, B., Santos, CN Dos, Gulcehre, C., & Xiang, B. (2016). Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond Proceedings of The SIGNLL Conference on Computational Natural Language Learning, 280-290; Nallapati, R., Zhou, B., Santos, CN two, Gulcehre, C., & Xiang , B. (2016) Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond, Proceedings of The SIGNLL Conference on Computational Natural Language Learning, 280 290.).
[0023]
[0024] DESCRIPTION OF THE INVENTION
[0025]
[0026] The object of the present invention is an automated process and system for extracting extractive text summaries using unsupervised deep learning and self-encoders.
[0027]
[0028] The autocoders have previously been applied for the execution of text summaries in single documents, but in the present invention, said autocoders are trained by representing the input text document using Frequency Term - Reverse Document Frequency vectors (TF-IDF) , which completely ignore the order of the generic text summary words. One of the main advantages of using auto-coders is that you can learn, in this case which is the set of sentences that best summarize the document, in an unsupervised way.
[0029]
[0030] The automated procedure and system for extracting extractive text summaries using deep unsupervised learning and, preferably, self-coding according to the present invention allows to make the extractive summary of a text document, both if it is obtained from the network through an internet connection, as transferred to a computer through a removable media device or mass storage.
[0031]
[0032] The automatic summary of documents, compared to the manual summary by an expert cancels subjectivity, errors due to fatigue and lack of attention, the disparity of criteria among experts, the costs associated with the time of the expert and decreases the time necessary for the completion of the summary . For this reason, this procedure can be implemented in tools used by companies and FFCCSSEE (State Security Forces) to summarize any type of document with textual content connected to the network, or in isolation, accessing documents through removable mass storage media.
[0033]
[0034] The present invention can also be applied to the generation of data sets in an unsupervised manner, that is, summaries of text documents that could be used later in the training of supervised and deep learning algorithms. The provision of large sets of document summaries would allow the training of more robust and reliable documentation summaries systems, which would allow obtaining more precise summaries of documents.
[0035]
[0036] In an exemplary embodiment, the system performs a phrase embedding phase, the output of which can be used to train an encoder that converts said inlays into embedded vectors, for example, by means of the "Skip-Thoughts" methodology (Kiros, M, et. al, Skipthought vectors, arXiv: 1506.06726v1, June 22nd, 2015). This allows you to map phrases that are semantically and syntactically similar in representations of similar vectors. Given any phrase, the representative vector of the same one is constructed using the phrases close to the first, because it is considered that they provide a great semantic and contextual information. The representation of the phrases in the embedding space causes the sentences with a similar meaning to be represented by similar vectors.
[0037]
[0038] Since the problem of generating automatic summaries can be considered as a problem of ordering or selection of sentences. The present invention contemplates a method for generating the summary of a text document obtained, for example, from the internet, comprising the steps of:
[0039] - obtaining the text document by means of a processor
[0040] - obtaining a series of sentences from the text document;
[0041] - coding the series of sentences by means of auto-coders, obtaining a series of coded phrases;
[0042] - assign a relevance measurement to each of the coded phrases;
[0043] - assign a novelty measurement to each of the coded phrases;
[0044] - assign a position measurement of each of the coded phrases;
[0045] - from a combination of the measures of relevance, novelty and position, assign a global score to each of the coded phrases;
[0046] - select the sentences to be included in the summary from the global score of the codified sentences;
[0047]
[0048] In an exemplary embodiment, obtaining a series of sentences from the text document is carried out by means of a model constructed by means of an unsupervised algorithm.
[0049]
[0050] On the other hand, the encoded phrases may correspond, for example, to a series of embedded vectors that is obtained using recurring neural networks. Preferably, the coding of the phrases is done using the Skip-Thought methodology.
[0051]
[0052] In a preferred embodiment, the method comprises obtaining an original latent representation of the document by concatenating the coded phrases.
[0053]
[0054] As to the extent of relevance of each phrase, said measure can be obtained by several methods, for example, based on the measure of cosine similarity between an original latent representation of the text document and a modified latent representation of the text document , being the modified latent representation obtained by eliminating the phrase from which you want to obtain its relevance.
[0055]
[0056] On the other hand, the measurement of novelty can be carried out, preferably, based on calculating the cosine similarity of the series of embedded vectors, obtaining an intermediate value of similarity and, depending on the intermediate value of similarity, assigning the novelty measurement. In one embodiment, the intermediate similarity value is calculated from the maximum value of cosine similarity between the embedded vectors. In another embodiment, the novelty measurement is 1 if the intermediate value is less than a predetermined threshold value. In short, the measurement of novelty can be defined as being equal to 1-V, where V is the intermediate value if the intermediate value is higher than the threshold value.
[0057] The position measurement of each phrase can be done, for example, taking into account the position of the sentence within the text document, as well as the number of sentences of the text document.
[0058]
[0059] Preferably, the measure of relevance comprises: generating a reference vector based on the series of sentences, generating a comparison vector of each phrase in which the comparison vector of each phrase corresponds to the reference vector by eliminating the parts of the reference vector which correspond to the sentence and calculate the relevance measurement based on a calculation of cosine similarity between the reference vector and each comparison vector. More preferably, the reference vector is obtained from the addition of elements of the embedded vectors, in particular, the reference vector can be obtained from a trained autocoder with the series of embedded vectors.
[0060]
[0061] In a particular embodiment, the selection of the phrases to be arranged in the summary comprises: organizing the sentences according to the overall score and selecting the sentences that are above a predetermined threshold score. Preferably, the selection of the sentences to be arranged in the summary includes: organizing the sentences according to the overall score and selecting the first X sentences, X being a predetermined value of sentences.
[0062]
[0063] In an embodiment of the present invention, obtaining the text document is performed from an external storage medium selected from: a ROM memory, a CD ROM memory or a semiconductor ROM, a USB flash memory, SD, mini-SD or micro-SD, a magnetic recording medium, a hard disk or a solid state memory.
[0064]
[0065] Furthermore, the present invention provides a system for generating a summary from a text document comprising means of accessing a text document and a processor configured to:
[0066] - get from the text document using a processor
[0067] - obtain a series of sentences from the text document;
[0068] - assign a novelty measurement to each of the phrases;
[0069] - assign a relevance measurement to each of the phrases;
[0070] - assign a position measurement to each of the phrases;
[0071] - from the measurements of novelty, relevance and position, assign a global score to each of the sentences; Y
[0072] - select the sentences to be included in the summary from the overall score of the sentences;
[0073] wherein the measurement of novelty comprises encoding, through the processor, the phrases to obtain a series of embedded vectors; calculate the cosine similarity of the series of embedded vectors obtaining an intermediate value of similarity and, depending on the intermediate similarity value, assign the novelty measurement.
[0074]
[0075] Preferably, the coding of the sentences to obtain the series of embedded vectors is done through the Skip-Thought methodology.
[0076]
[0077] In addition, the processor may be configured, for example, to:
[0078] - assign a relevance measurement to each of the phrases; and - assign the global score according to the measurement of relevance in which the measurement of relevance comprises: generating a reference vector based on the series of sentences, generating a comparison vector of each phrase in which the comparison vector of each phrase corresponds to the reference vector by eliminating the parts of the reference vector corresponding to the sentence and calculating the relevance measurement based on a calculation of cosine similarity between the reference vector and each comparison vector.
[0079]
[0080] The processor may be preferably configured to:
[0081] - assign a position measurement; Y
[0082] - assign the global position according to the position measurement;
[0083] wherein the position measurement is calculated according to the relative position of the sentence with respect to the document.
[0084]
[0085] In addition, the present invention contemplates a program product comprising program instruction means for carrying out the procedures described above when the program is executed in a processor and, likewise, contemplates a program product stored in a support medium of programs.
[0086]
[0087] In a particularly preferred embodiment, the method of the invention calculates the (i) measure of position of the sentence with respect to the text, (ii) measure of novelty of the phrase as a function of the detection of similarity between embedded vectors and ( iii) measurement of relevance of the phrase. These measures can be combined to give a final score to each sentence using a weighted merger of these measures.
[0088]
[0089] Once all the scores of all the phrases of the entry document have been obtained, the method of the invention selects the sentences with the highest scores to represent the summary of the document. This selection can be made in two different ways, (a) sorting the summary sentences in descending order with respect to their relative scores and selecting the first sentences until reaching a pre-established number of sentences, (b) selecting all the sentences whose score global is above a predetermined threshold u (c) ordering the sentences based on their frequency of appearance in the input document.
[0090]
[0091] In a preferred embodiment of the invention this method is applied to any type of textual document, both downloaded from the Web, and supplied to the system through an external storage device of any type.
[0092]
[0093] An example of the automated method and system for performing abstracts of extractive text using unsupervised deep learning and self-encoders of the present invention comprises the following steps:
[0094]
[0095] 1. Obtaining a text document. This obtaining can be done in an online mode, through a computer with an internet connection, or in a mode without line, obtaining the text document through an external storage device.
[0096]
[0097] 2. Embedding sentences in the text document: Consists of embedding the sentences in the text document. That is, the sentences of the input document are transformed into a series of embedded vectors of fixed length, so that sentences with similar meanings will have similar vectorial representations, and vice versa, sentences with different meanings will have different vectorial representations. In a preferred embodiment of the invention, the "skip-thought" vector methodology is used to perform this embedding.
[0098]
[0099] 3. Coding of the embedded phrases: In a preferred embodiment of the invention, a process of coding the embedded phrases is carried out, the embedded vectors being converted into a vector representation of smaller dimension called the vector of reference. In a preferred embodiment, a network of self-encoders is designed to obtain the reference vector by feeding it with the embedded vectors resulting from the previous phase. These embedded vectors are combined in embedded text units that are used to train a network of autocoders. In a preferred embodiment of the invention, once the network is trained, only its part of the encoder is used to generate representations of textual units, the combination of which will give rise to the original latent representation of the document.
[0100]
[0101] 4. Calculation of the measure of the relevance of the sentence: once the reference vector has been obtained, corresponding to the original latent representation of the document, a series of comparison vectors are calculated that would be modified latent representations of the document, one for each phrase contained in the document. For this, the information corresponding to a phrase of the document is eliminated from the reference vector, thus generating the comparison vector corresponding to said phrase, that is, its modified latent representation. Then, to calculate the measure of the relevance of said sentence, the cosine similarity between the original latent representation (the reference vector) and the modified latent representation (the comparison vector) is calculated. In a preferred embodiment, the measure of relevance takes values between 0 and 1, the most relevant phrase being the greater the value of this measure close to one.
[0102]
[0103] 5. Calculation of the measure of the novelty of the phrase: to calculate the measure of the novelty of a sentence, the cosine similarity between the embedded vectors corresponding to two sentences is calculated. In a preferred embodiment, the resulting value will be between 0 and 1, the most novel phrase being the greater the value of this measure close to one.
[0104]
[0105] 6. Calculation of the position measurement of the sentence: to calculate the measurement of the position of a sentence with respect to a document, the position occupied by said phrase within the original document, as well as the number of phrases of it. In a preferred embodiment, the resulting value will be between 1 and 0.5, the value of the position of the first sentence 1 being and decreasing said value in successive sentences.
[0106]
[0107] 6. Calculation of the final score of each sentence: to perform the ordering of the sentences of a document according to the values of the measures of relevance, novelty and position, the calculation of the final score of each sentence of the original document is made. In a preferred embodiment of the invention, said value results from the weighted sum of the measures of relevance, novelty and position.
[0108]
[0109] 7. Selection of phrases that will form the final summary of the document: once the final punctuation of each sentence of the document has been calculated, we proceed to the selection of the sentences that will constitute the summary of the original document. In a preferred embodiment, this selection can be made in two different ways: (i) by arranging the sentences according to the final score of each sentence in descending order and choosing a specific number of sentences with the highest final score or (ii) arranging the sentences according to their frequency of appearance in the original document.
[0110]
[0111] BRIEF DESCRIPTION OF THE DRAWINGS
[0112] Next, a series of figures are described which help to better understand the invention and which are expressly related to an embodiment of said invention that is presented as a non-limiting example thereof.
[0113]
[0114] Fig. 1 shows a simplified diagram of a system configured to carry out the method of the invention.
[0115]
[0116] Fig. 2 shows an example of the conversion of the S sentences of a D document to embedded sentences S through the "skip-thoughts" vector space.
[0117]
[0118] Fig. 3 shows an example of the conversion of the embedded phrases S into embedded text units T, subsequently reduced to latent textual units f , which will be joined in the latent representation of the document f >.
[0119]
[0120] PREFERRED EMBODIMENT OF THE INVENTION
[0121]
[0122] An example of a method according to the invention is described below, with reference to the appended figures. Figure 1 shows a simplified diagram of an example of automatic summarization system of a text (1) arranged in a document. Said system can be implemented in a computer or any other means of data processing, for example, a desktop or laptop computer with a core, at least 8 Gb of RAM and at least 16 Gb of hard disk. The computer could obtain the text (1) of the network, for which it would need Internet connection, but the task could also be done automatic summary of the text (1) without internet connection on documents that are copied directly to the computer or stored in a memory accessible to the computer.
[0123]
[0124] First, the system is configured to divide the obtained text into a series of sentences (2). Then, an inlay is made of the phrases (3) obtaining a series of embedded vectors, these vectors are transferred to an encoder (4) which, in turn, generates a reference vector and a series of comparison vectors (5). ) each of the comparison vectors being associated with one of the phrases of the series of sentences (2) obtained above. Said coder (4) can, in one embodiment, be configured to generate a vector representation of smaller dimension, for example, by using trained autocoders using the comparison vectors (5).
[0125]
[0126] At this point, the three measurements forming part of a particularly preferred embodiment of the programmed algorithm are calculated using a method of the type disclosed by the present invention, ie, a measure of the position of the sentence (6), a measure of the novelty of the sentence (7) and a measure of the relevance of the sentence (8). With these three measures, a calculation of a global score (9) of each sentence can be made and a selection of sentences (10) based on said global score (9) that will result in the summary text (11) from of the text 1. In this example of the method according to the invention, a summary text (11) is generated for each text (1) analyzed. In the following, each step of an example of a method according to the present invention is described.
[0127]
[0128] To obtain the text (1) to be automatically summarized, the computer can be connected to the internet through a wireless connection or through an Ethernet network cable. Alternatively, the text (1) can be obtained through a support medium, which can be any entity or device capable of storing text documents. For example, the medium could include a storage medium, such as a ROM memory, a CD ROM memory or a semiconductor ROM, a USB flash memory, SD, mini-SD or micro-SD, a magnetic recording medium, example, a hard disk or a solid state memory (SSD ). The purpose of this connection and configuration to the network, or the availability of medium media of any kind, is to obtain the necessary raw text to be able to obtain the text (1) on which the text summary will be made extractive using unsupervised deep learning and self-encoders of the present invention.
[0129] Once the text (1) is obtained, it is separated into sentences (2) using a classifier (for example, a function similar to those known in the different programming languages as "tokenizer") that uses an unsupervised algorithm to build a model for abbreviated words, phrases and words that are used to start sentences. Before being able to be used, the model must be trained using a large collection of text in the language on which the sentence separation is to be performed.
[0130] Then, the embedding of the sentences (3) is carried out for each text (1) that is intended to be summarized. In this preferred embodiment of the invention, each phrase of the input document s is embedded in an S vector of 2400 dimensions using the "skip-thought" vector methodology to perform this embedding. In this preferred embodiment of the invention, the model is based on an encoder-decoder network, where the encoder is formed by a recurrent neural network (RNN) encoded with closed recurring units (in English Gated Recurrent Units - GRUs) (Chung, J ., Gulcehre, C., Cho, K., & Bengio, Y. (2014), Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, In Proceedings of Deep Learning and Representation Learning Workshop: NIPS 2014, 1-9) and the decoder is formed by a recurrent neural network (RNN) with recurring closed conditional units. In this preferred embodiment of the invention, the model is trained in the unlabeled data set called BookCorpus (Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A ., & Fidler, S. (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books Proceedings of the IEEE International Conference on Computer Vision, 19-27)
[0131]
[0132] Figure 2 shows an example of original phrases s (2) obtained from a text document D (1) without embedding. As can be seen, after applying the "skip-thought" method, these phrases are embedded in embedded vectors S (3) that can be of fixed length, for example, of 2400 elements. These embedded vectors are an example of numerical vector representation of the sentences in the text (1) that allow the realization of mathematical functions and, consequently, automate the process of generating the summary.
[0133]
[0134] Once the embedded vectors (3) are obtained, they can be used to obtain a novelty measurement (7) by carrying out calculations of similarity between them, as will be explained in more detail below.
[0135]
[0136] In the next step we proceed to the coding (4) of the embedded sentences S, for example, in a vector representation of smaller dimension called textual units embedded f. In a preferred embodiment, a network of auto-encoders is designed to obtain this vector representation of smaller dimension by feeding it with the embedded vectors S resulting from the previous phase. These embedded vectors are combined in embedded text units f that are used to train the network of auto-encoders. In a preferred embodiment of the invention, once the network is trained, only its encoder part is used to generate latent textual units f , the combination of which will result in the original latent representation of document D. Figure 3 shows an example of how the embedded sentences S are converted into embedded text units f, subsequently reduced to latent textual units f , which will be joined in the latent representation of document D.
[0137]
[0138] In particular, Figure 3 shows how, from the embedded vectors (3) obtained in previous stages, a vector of embedded text units (40) is generated which is, basically, the combination of all the elements corresponding to the vectors embedded (3) in a single auxiliary vector. Said vector of embedded text units (40) is encoded, for example, by means of autocoders to obtain a smaller vector (41) and to reduce the computational cost of the procedure. Once obtained, a reference vector (42) is created containing information corresponding to each of the obtained phrases (2). Finally, said reference vector (42) can be used to calculate a relevance measurement (8) as will be explained below.
[0139]
[0140] After obtaining the latent representation of the document or reference vector (42), the calculation of the relevance measure (8) of the sentence is carried out. In a preferred embodiment of the invention, after obtaining the original latent representation of document D, said latent representation is used as a reference vector (42), later the modified latent representations of the modf document are calculated. If , one for each phrase contained in the document and are used as comparison vectors. For this, a sentence of the document is deleted and a latent representation of the same modvS is generated . , but where said phrase is not included. Then, to calculate the measure of the relevance (8) of said sentence, the cosine similarity scoreContR (ü, s¿) between the original latent representation D and the modified latent representation modf> S is calculated. .
[0141]
[0142]
[0143] In a preferred embodiment of the invention, the measure of the relevance of a phrase takes values between 0 and 1, the most relevant phrase being the closer the value of this measure to one.
[0144]
[0145] The measure of the novelty (7) of the phrase itself is a measure of novelty, preferably with a low value if the sentence is redundant or repetitive, and a high value if the phrase is new. In a particularly preferred embodiment, the cosine similarity between the embedded vectors (3) corresponding to each of the two phrases ^yes and ¿J. (30) is calculated:
[0146]
[0147]
[0148]
[0149] In a preferred embodiment of the invention, the novelty measure of a sentence in a document is calculated on the basis of the cosine similarity S im ^ áJ) and the measure of the scoreContR (v, m odvSi) previously calculated as follows:
[0150]
[0151]
[0152]
[0153]
[0154] Being 8 the number of phrases of a document and
[0155]
[0156] In a preferred embodiment, the resulting scoreNov (D, ¿') value will be between 0 and 1, the most novel sentence being the greater the value of this measure close to one.
[0157]
[0158] The calculation of the position measurement (6) of the sentence with respect to the text (1) is made taking into account the position that said phrase occupies within the original document, as well as the number of phrases thereof. In a preferred embodiment of the invention, the measurement of the position of a phrase with respect to a document D is calculated as follows:
[0159]
[0160]
[0161]
[0162]
[0163] Where max represents the maximum value between 0.5 and the contiguous expression, in which exp. represents the exponential function, 8 the number of phrases in the document and M (¿() is a function that supplies the relative position of the phrase in the document. Preferred of the invention, M (() = 1 for the first sentence In a preferred embodiment, the resulting value will be between 1 and 0.5, the value of the position of the first sentence 1 being and decreasing said value in successive sentences.
[0164]
[0165] Once the measurements of position, novelty and relevance of the sentences have been obtained, the calculation of the final score of each sentence 9 is carried out. In a preferred embodiment of the invention, said value results from the weighted sum of the three previous measures, multiplied each of them by the corresponding coefficients of relevance a, novelty and position y.
[0166]
[0167]
[0168]
[0169]
[0170] In a preferred embodiment of the invention, the values a, p, y can take any value between 0 and 1 and are determined empirically. In a preferred embodiment of the invention, particularly preferred values are identified: a = 0.45, p = 0.35 and y = 0.20. However, in other embodiments of the present invention any value that meets the requirement is used: a> p> Y.
[0171] Finally, the selection of sentences 10 that will be part of the summary text document 11 of the original text document is made. In the following equation, SCOfí (D) represents an ordered list of the measurements obtained for the sentences of a document:
[0172]
[0173]
[0174]
[0175]
[0176] In a preferred embodiment of the invention, the relative ordering of each phrase fíag (í) within a document can be obtained by calculating the ordering of its final measurement according to the following equations:
[0177]
[0178]
[0179]
[0180] Where e (epsilon) is a very small constant. In a preferred embodiment, £ ^ 0+ takes a very small value and is used to solve the possible situation where scorea (D, ¿e) = scorea (D, ⁱ⁾ it allowing to prioritize the position of a sentence .
[0181]
[0182] In a preferred embodiment, the next step is to choose phrases with higher orderings relative to generate the summary (11) of the original text (1). In a preferred embodiment, the summary text document Summary (D, L) will contain L phrases. This selection of L phrases can be done in two different ways: (i) by arranging the sentences according to the final score of each sentence in descending order and choosing a specific number of sentences with the highest final score
[0183]
[0184]
[0185]
[0186]
[0187] or (ii) arranging the sentences according to their frequency of appearance in the original document.
[0188]
[0189]

权利要求:
Claims (24)
[1]
1. Procedure for generating a summary from a text document that includes the stages of:
- obtaining the text document by means of a processor
- obtaining a series of sentences from the text document;
- coding the series of sentences through an encoder-decoder network, obtaining a series of embedded phrases;
- coding the embedded phrases obtaining a vector representation of smaller dimension using a network of autocoders, obtaining codified phrases; - assign a relevance measurement to each of the coded phrases;
- assign a novelty measurement to each of the coded phrases;
- assign a position measurement of each of the coded phrases;
- from a combination of the measures of relevance, novelty and position, assign a global score to each of the coded phrases;
- select the sentences to be included in the summary from the global score of the codified sentences;
[2]
Method, according to claim 1, characterized in that obtaining a series of sentences from the text document is carried out by means of a model constructed by means of an unsupervised algorithm.
[3]
3. Method according to any of claims 1 or 2, characterized in that the coded phrases correspond to a series of embedded vectors that is obtained using recurrent neural networks.
[4]
4. Method according to any of claims 1 to 3, characterized in that the coding of the phrases is carried out using the Skip-Thought methodology.
[5]
Method according to claim 1, characterized in that it comprises obtaining an original latent representation of the document by concatenating the coded phrases.
[6]
6. Procedure, according to claim 5, characterized in that the measure of relevance of each phrase is obtained based on the measure of cosine similarity existing between a original latent representation of the text document and a modified latent representation of the text document, the modified latent representation obtained by eliminating the phrase from which it is desired to obtain its relevance.
[7]
Method, according to claim 1, characterized in that the novelty measurement is based on calculating the cosine similarity of the series of embedded vectors obtaining an intermediate value of similarity and, depending on the intermediate value of similarity, assigning the novelty measurement.
[8]
Method according to claim 7, characterized in that the intermediate value of similarity is calculated from the maximum value of cosine similarity between the embedded vectors.
[9]
9. Method according to claim 7, characterized in that the novelty measurement is 1 if the intermediate value is lower than a predetermined threshold value.
[10]
Method according to claim 7, characterized in that the novelty measurement is equal to 1-V, where V is the intermediate value if the intermediate value is greater than the threshold value.
[11]
Method according to claim 1, characterized in that the position measurement of each phrase is carried out taking into account the position of the sentence within the text document, as well as the number of phrases of the text document.
[12]
Method, according to any of the preceding claims, characterized in that the measurement of relevance comprises: generating a reference vector based on the series of sentences, generating a comparison vector of each phrase in which the comparison vector of each phrase corresponds to the reference vector by eliminating the parts of the reference vector corresponding to the sentence and calculating the relevance measurement based on a computation of cosine similarity between the reference vector and each comparison vector.
[13]
13. Process according to claim 12, characterized in that the reference vector is obtained from the addition of elements of the embedded vectors.
[14]
14. Method according to claim 13, characterized in that the reference vector is obtained from a trained autocoder with the series of embedded vectors.
[15]
15. Method according to any of the preceding claims, characterized in that the selection of the phrases to be arranged in the summary comprises: organizing the sentences according to the overall score and selecting the sentences that are above a predetermined threshold score.
[16]
Method according to any of the preceding claims, characterized in that the selection of the phrases to be arranged in the summary comprises: organizing the sentences according to the overall score and selecting the first X sentences, X being a predetermined value of sentences.
[17]
17. Method according to any of the preceding claims, characterized in that the obtaining of the text document is carried out through the internet.
[18]
18. Method according to any of claims 1 to 17, characterized in that the obtaining of the text document is carried out from an external storage medium selected from: a ROM memory, a CD ROM memory or a semiconductor ROM memory, a USB, SD, mini-SD or micro-SD flash memory, a magnetic recording medium, a hard disk or a solid state memory.
[19]
19. System for generating a summary from a text document that includes means of accessing a text document and a processor configured to:
- get from the text document using a processor
- obtain a series of sentences from the text document;
- assign a novelty measurement to each of the phrases;
- assign a relevance measurement to each of the phrases;
- assign a position measurement to each of the phrases;
- from the measurements of novelty, relevance and position, assign a global score to each of the sentences;
- select the sentences to be included in the summary from the overall score of the sentences;
characterized in that the novelty measurement comprises encoding, through the processor, the phrases to obtain a series of embedded vectors; calculate the cosine similarity of the series of embedded vectors obtaining an intermediate value of similarity and, depending on the intermediate similarity value, assign the novelty measurement.
[20]
System, according to claim 19, characterized in that the coding of the sentences to obtain the series of embedded vectors is carried out using the Skip-Thought methodology.
[21]
System, according to any of claims 19 or 20, characterized in that the processor is configured to:
- assign a relevance measurement to each of the phrases; Y
- assign the global score according to the relevance measurement
wherein the measure of relevance comprises: generating a reference vector based on the series of sentences, generating a comparison vector of each phrase in which the comparison vector of each phrase corresponds to the reference vector by eliminating the parts of the vector of comparison. reference that correspond to the sentence and calculate the relevance measurement based on a calculation of cosine similarity between the reference vector and each comparison vector.
[22]
22. System, according to any of claims 19 to 21, characterized in that the processor is configured to:
- assign a position measurement; Y
- assign the global position according to the position measurement;
wherein the position measurement is calculated according to the relative position of the sentence with respect to the document.
[23]
23. A program product comprising program instruction means for carrying out the method defined in any of claims 1 to 18 when the program is executed in a processor.
[24]
24. A program product according to claim 23, stored in a program support medium.

类似技术:

公开号 | 公开日 | 专利标题

Cohan et al.2018|A discourse-aware attention model for abstractive summarization of long documents

Mathews et al.2018|Semstyle: Learning to generate stylised image captions using unaligned text

Lu et al.2017|Knowing when to look: Adaptive attention via a visual sentinel for image captioning

KR102363369B1|2022-02-15|Generating vector representations of documents

US10909157B2|2021-02-02|Abstraction of text summarization

CN107315737A|2017-11-03|A kind of semantic logic processing method and system

Braud et al.2016|Multi-view and multi-task training of RST discourse parsers

CN106202010A|2016-12-07|The method and apparatus building Law Text syntax tree based on deep neural network

US20200065374A1|2020-02-27|Method and system for joint named entity recognition and relation extraction using convolutional neural network

CN109344404B|2020-08-25|Context-aware dual-attention natural language reasoning method

US20200134422A1|2020-04-30|Relation extraction from text using machine learning

Kumar et al.2020|Syntax-guided controlled generation of paraphrases

Esmaeilzadeh et al.2019|Neural abstractive text summarization and fake news detection

CN110796160A|2020-02-14|Text classification method, device and storage medium

KR20160133349A|2016-11-22|Method for generating a phase table and method for machine translation using the phase table

Cheng et al.2017|A hierarchical multimodal attention-based neural network for image captioning

CN110347790B|2021-08-10|Text duplicate checking method, device and equipment based on attention mechanism and storage medium

KR101923780B1|2018-11-29|Consistent topic text generation method and text generation apparatus performing the same

CN111695325A|2020-09-22|Parse tree based vectorization for natural language processing

CN111158692A|2020-05-15|Method, system and storage medium for ordering similarity of intelligent contract functions

Aliwy2012|Tokenization as preprocessing for arabic tagging system

Fang et al.2020|A method of automatic text summarisation based on long short-term memory

CN106339368A|2017-01-18|Text emotional tendency acquiring method and device

ES2716634B2|2020-11-26|PROCEDURE AND SYSTEM FOR GENERATING EXTRACTIVE TEXT SUMMARIES USING UNSUVED DEEP LEARNING AND SELF-CODING

CN113111663A|2021-07-13|Abstract generation method fusing key information

同族专利:

公开号 | 公开日

ES2716634B2|2020-11-26|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

WO2018046412A1|2016-09-07|2018-03-15|Koninklijke Philips N.V.|Semi-supervised classification with stacked autoencoder|

法律状态:
2019-06-13| BA2A| Patent application published|Ref document number: 2716634 Country of ref document: ES Kind code of ref document: A1 Effective date: 20190613 |

2020-11-26| FG2A| Definitive protection|Ref document number: 2716634 Country of ref document: ES Kind code of ref document: B2 Effective date: 20201126 |

优先权:

申请号 | 申请日 | 专利标题

ES201831222A|ES2716634B2|2018-12-14|2018-12-14|PROCEDURE AND SYSTEM FOR GENERATING EXTRACTIVE TEXT SUMMARIES USING UNSUVED DEEP LEARNING AND SELF-CODING|ES201831222A| ES2716634B2|2018-12-14|2018-12-14|PROCEDURE AND SYSTEM FOR GENERATING EXTRACTIVE TEXT SUMMARIES USING UNSUVED DEEP LEARNING AND SELF-CODING|

[返回顶部]